Advanced Graphics

This course aimed at generating professional graphics for publications with much better look!

The main package used here is called ggplot2: An Implementation of the Grammar of Graphics [how to write a graphic grammar sentence]

Points to be covered in this session

  1. Whiskers and box plot
  1. Whiskers and box plot overlaid with dot plot
  1. Violin plot
  1. Scatter plot
  1. Introducing faceting
  1. Line plot
  1. Error bars
  1. Histograms, histogram overlaid with density curve
  1. Density plot

Data loading:

Most probably, your data in an excel file.If so, the most easiest way is to transfer your data from the excel to R through clipboard.

to do that

  1. Paste this code in the console of R
data <- read.table(file="clipboard",header=TRUE,sep="\t")
  1. Open your excel sheet, highlight your data and press ctrl+C “to copy them”
  1. Put your cursor on your code (step 1), press ctrl+enter
  1. your data should be copied stored in R
  1. to confirm, in the upper right panel of R studio, press on “Environment”, you should see your data there.

Now we need to install and load ggplot2 package. the package can be directly downloaded from internet or from zip /tar.gz files stored on your local drive.

from internet: right in R studio click install,“select install from repository” type package name “ggplot2”, make sure independence is checked.

from local disk: instead of selecting install from repository, select package archive file and then select your zip file.

Package required

library(ggplot2)# the main player
library(reshape2)# reshaping and melting your data from wide to long
library(RColorBrewer)# color your graph in artistic way like leonardo da vinci :)
library(scales)# rescaling your axis
library(plyr)# data manipulation
library(dplyr)# data manipulation

Preparing the dataset

In this tutorial, i will use iris dataset. This dataset is integrated in the base package comes with r

head(iris)# checking the first 6 rows
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
str(iris)# check the dataset parameters
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris)# have a quick look on the min, max mean,,etc
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Lets make another dataset called iris1 without species column

iris1 <- iris[,1:4]# selecting only columns from 1-4 and name the dataset iris1

1. Whiskers and box plot

head(iris1)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1          5.1         3.5          1.4         0.2
## 2          4.9         3.0          1.4         0.2
## 3          4.7         3.2          1.3         0.2
## 4          4.6         3.1          1.5         0.2
## 5          5.0         3.6          1.4         0.2
## 6          5.4         3.9          1.7         0.4

The table format is not appropriate. This table format is called wide format, we need to change it into long format. To do that , we will use melt function (reshape2 package)

iris2 <- melt(iris1)
## No id variables; using all as measure variables
head(iris2)
##       variable value
## 1 Sepal.Length   5.1
## 2 Sepal.Length   4.9
## 3 Sepal.Length   4.7
## 4 Sepal.Length   4.6
## 5 Sepal.Length   5.0
## 6 Sepal.Length   5.4
tail(iris2)
##        variable value
## 595 Petal.Width   2.5
## 596 Petal.Width   2.3
## 597 Petal.Width   1.9
## 598 Petal.Width   2.0
## 599 Petal.Width   2.3
## 600 Petal.Width   1.8

Lets draw the plot

ggplot(iris2, aes(x=variable, y=value)) + 
  geom_boxplot(notch=FALSE, width=0.5) +
  theme_bw() +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

With notch

ggplot(iris2, aes(x=variable, y=value)) + 
  geom_boxplot(notch=TRUE, width=0.5) +
  theme_bw() +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

To understand the box plot, look to the following figure

The box shows the interquartile range (IQR). The IQR is the 25 to 75 percentile also known as (aka) Q1 and Q3. The IQR is where the center 50% of your data points will fall.

The whiskers add 1.5 times the IQR to the 75 percentile (aka Q3) and subtract 1.5 times the IQR from the 25 percentile (aka Q1). The whiskers should include 99.3% of the data if from a normal distribution.

The Line - Shows the median of the data.

The Notch - displays the a confidence interval around the median which is normally based on the median +/- 1.57 x IQR/sqrt of n.

my graph my graph

Ok back to our code, lets tweak it more

ggplot(iris2, aes(x=variable, y=value, fill=variable)) + 
  geom_boxplot(notch=TRUE, width=0.5) +
  theme_bw() +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  xlab("Flower species") +
  ylab("measured value")


2. Whiskers and box plot overlaid with dot plot

Lets read iris 3 dataset. this dataset is in long format but contains additional column called “new” contains nominal values from 1-5.

iris3 <- read.csv("files/iris3.csv",header=T)# loading dataset using csv loading code
str(iris3)# see that new col is integer
## 'data.frame':    600 obs. of  3 variables:
##  $ variable: chr  "Petal.Length" "Petal.Length" "Petal.Length" "Petal.Length" ...
##  $ value   : num  1 1 1.1 1.1 1.2 1.2 1.2 1.2 1.3 1.3 ...
##  $ new     : int  3 3 5 3 5 3 3 3 3 5 ...
unique(iris3$new)#calling unique values in new column
## [1] 3 5 1 2 4
#reset graphical device
dev.off()
## null device 
##           1
ggplot(iris3, aes(x=variable, y=value,fill=as.factor(new))) + 
  geom_boxplot(outlier.colour=NA, width=.7,notch=T,fill="gray90")+
  theme_bw() +
  geom_dotplot(binaxis="y", binwidth=0.04, stackdir="center") +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  xlab("Flower part") +
  ylab("Values") + 
  guides(fill=guide_legend(title="Rank of value"))

I can move the legend position anywhere. Note the difference in code

ggplot(iris3, aes(x=variable, y=value,fill=as.factor(new))) +
  geom_boxplot(outlier.colour=NA, width=.7,notch=T,fill="gray90") +
  theme_bw() +
  geom_dotplot(binaxis="y", binwidth=0.04, stackdir="center") + 
  theme(text = element_text(size=20, face="bold", colour="black"), axis.text.x=element_text(vjust=2)) + 
  xlab("Flower part") + 
  ylab("Values") + 
  theme(legend.position='inside', legend.position.inside = c(1, 1), legend.justification=c(1,1)) + 
  guides(fill=guide_legend(title="Rank of value"))

This graph adds 3rd variable to your graph, now u have x, y and z represents color of distribution say (rank of value)

Figure output should be like this

my graph
my graph

3. Violin plot

A violin plot is a method of plotting numeric data. It is a box plot with a rotated kernel density plot on each side

Lets use iris 3 dataset.

ggplot(iris3, aes(x=variable, y=value)) + 
  geom_violin(fill="gray") +
  geom_boxplot(width=0.2, fill="black", outlier.colour=NA) +
  stat_summary(fun.y=median, geom="point", fill="white", shape=21, size=4) +
  theme_bw() +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#reset graphical device
dev.off()
## null device 
##           1

4. Scatter plot

Diamonds is a dataset of prices of 50.000 round cut diamonds built in the ggplot packages, lets see it

head(diamonds,n=10)# calling first 10 rows
## # A tibble: 10 × 10
##    carat cut       color clarity depth table price     x     y     z
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
##  1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
##  2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
##  3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
##  4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
##  5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
##  6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
##  7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
##  8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
##  9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
## 10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Lets scatter plot the carat against its price

ggplot(diamonds, aes(x=carat,y=price)) +
  geom_point(size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

Lets add another piece of information to the scatter, the color. so i will plot each spot with its designated color in the table

Lets see first what diamonds color is?

unique(diamonds$color)# calling unique values in color column
## [1] E I J H F G D
## Levels: D < E < F < G < H < I < J

Well in diamonds, there is 7 colors

ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

Note the change in the code, i added fill = color, and also i changed the shape of the dot to 21 “which can be filled by color”. the color is a discrete value but R can also color continuous values, lets see how

Lets say we would like to color the spots here based on the table (length to width ratio) in diamonds dataset

ggplot(diamonds, aes(x=carat,y=price,fill=table)) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

Note the legend changed to scale

Also geometric rug is a rug added to the margins of the graph to define density

ggplot(diamonds, aes(x=carat,y=price,fill=table)) +
  geom_point(shape=21,size=5,alpha=0.5) +
  geom_rug(position="jitter", size=.01) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Well, so far so good ?!?!


5. Facetting

In some circumstances we want to plot relationships between set variables in multiple subsets of the data with the results appearing as panels in a larger figure. This is a known as a facet plot. This is a very useful feature of ggplot2. The faceting is defined by a categorical variable or variables. Each panel plot corresponds to a set value of the variable.

Lets back to our plot carat versus price and filled by color, this one

ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

head (diamonds)
## # A tibble: 6 × 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Note that this plot represents all carat, price relationship with its color. These data is not separated by cut, “in another word you can not get any info from this graph about the cut”. Facetting subset the data into several plots based on another variable

Lets see, i will segregate this graph into several graphs based on the cut

ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  facet_wrap(~cut,nrow=2)

With geom rug

ggplot(diamonds, aes(x=carat,y=price,fill=color)) +
  geom_rug(position="jitter", linewidth=.01) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  facet_wrap(~cut,nrow=2)

R has a powerfull coloring palette, ther is a package called “RColorBrewer”. of course u can define your own palette “however, im not discussing this issue right now”

display.brewer.all()

Lets change this fancy colors into different one. but before that i will name the previous graph code as “graph1”

graph1 <- ggplot(diamonds,aes(x=carat,y=price,fill=color)) +
  geom_rug(position="jitter", size=.01) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  facet_wrap(~cut,nrow=2)

Typing graph1 will generate the graph

graph1 + scale_fill_brewer(palette="Purples")# i will choose purple coloring, i like purple!

We can see that colorless diamond “more expensive;D” is mostly available in smaller carat and difficult to be in bigger sized diamond. makes sense!

Do you know why diamond is expensive? its chemical formula is “C” pure carbon”

$30.6 sold in 2013 [most largest and expensive diamond]

my graph
my graph

In scatter plot, you can control shape and size of the spots based on a certain variable, lets see ,,

I will plot carat vs price and i will change the size based on x column (length)

ggplot(diamonds, aes(x=carat,y=price,size=x)) +
  geom_point(shape=21,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

Note i removed point size from the geom_point

size is better for continuous variable

or i can change the shape also

ggplot(diamonds, aes(x=carat,y=price,shape=cut)) +
  geom_point(size=3.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
## Warning: Using shapes for an ordinal variable is not advised

I can change the y axis into log scale. this require a package called scales

graph2 <- ggplot(diamonds,aes(x=carat,y=price,fill=color)) +
  geom_point(shape=21,size=5,alpha=0.5) +
  theme_bw() +
  theme(text = element_text(size=15, face="bold", colour="black"),axis.text.x = element_text(vjust=2))
graph2 + scale_y_continuous(trans=log2_trans())# log 10 works also

Turns y to log with visually-diminishing spacing [requires scale library]

graph2 + coord_trans(y="log2")

This is another twaek to adjust y axis into scientific appeal

graph2 + 
  scale_y_continuous(trans = log2_trans(),breaks = trans_breaks("log2", function(x) 2^x), labels = trans_format("log2", math_format(2^.x)))


6. Line plot

Lets use mtcars dataset (datasets package)

head(mtcars,n=10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4

Now, lets plot a line graph for mgp as x and disp as y

ggplot(mtcars,aes(x=mpg,y=disp)) +
  geom_line() # simple base code

# note that x data are continuous here

R understand here that your data input is numeric (continuous). in some cases u may need to tell R that your data is categorical (discrete), lets see how for the same graph

ggplot(mtcars,aes(x=factor(mpg),y=disp,group=1)) +
  geom_line()

Can you see the difference?

We can tweak the code by adding points

ggplot(mtcars,aes(x=factor(mpg),y=disp,group=1)) +
  geom_line() +
  geom_point()

In line plot, we may need to plot different data with colors, lets do that

head(mtcars,n=3)
##                mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4     21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710    22.8   4  108  93 3.85 2.320 18.61  1  1    4    1

Suppose we want to plot mpg and disp based on gear column, we can do that

ggplot(mtcars,aes(x=mpg,y=disp,colour=factor(gear))) +
  geom_line() +
  geom_point()

# note u need to tell R that gear are factor in order to be plotted independently

As i showed earlier, u can control color and shape to introduce more informative variables. I will add the information of (am column) by changing the shape of the points

ggplot(mtcars,aes(x=mpg,y=disp,colour=factor(gear),shape=factor(am))) +
  geom_line() +
  geom_point(size=4)

Control line from geom_line

ggplot(mtcars,aes(x=mpg,y=disp,colour=factor(gear),shape=factor(am))) +
  geom_line(size=1.5) +
  geom_point(size=4)


7. Error bars

In order to understand how to draw a graph with error bars, lets create a simple data frame

a=c("a","a","a","a","b","b","b","b","b","c","c","c","c","c","c")
b=c(1,2,3,4,5,6,4,4,1,2,3,4,5,6,7)
c=c(23,32,23,34,56,13,12,13,13,24,56,23,21,12,31)
d=c(23,43,54,54,56,67,65,34,15,67,87,65,43,46,45)
f=c("m","f","m","f","m","m","f","f","f","m","f","m","f","m","f")
data=data.frame(a=a,b=b,c=c,d=d,f=f)
data
##    a b  c  d f
## 1  a 1 23 23 m
## 2  a 2 32 43 f
## 3  a 3 23 54 m
## 4  a 4 34 54 f
## 5  b 5 56 56 m
## 6  b 6 13 67 m
## 7  b 4 12 65 f
## 8  b 4 13 34 f
## 9  b 1 13 15 f
## 10 c 2 24 67 m
## 11 c 3 56 87 f
## 12 c 4 23 65 m
## 13 c 5 21 43 f
## 14 c 6 12 46 m
## 15 c 7 31 45 f

We need to summarize the data to include the means , standard deviation and standard error. I will use dplyr package to get this info then i will use it for the plot.

ddplyr function: Split data frame, apply function, and return results in a data frame

Undertand the code please !. I ask R to split the data based on a, then summarize the b column to generate mean, median and standard error.

If the data includes NA. or missing data, R wont be able to generate these summaries (like mean or SE), thats why, im telling R if you find NA. ignore it (“!is.na” means not NA.)

Lets see i will call the summarized dataframe “sum”

sum <- ddply(data, c("a"), 
             summarise,
             mb = mean(b, na.rm=TRUE),
             medb=median(b, na.rm=TRUE), 
             sd = sd(b, na.rm=TRUE),
             n = sum(!is.na(b)),
             se = sd/sqrt(n))

Lets check it

sum
##   a  mb medb       sd n        se
## 1 a 2.5  2.5 1.290994 4 0.6454972
## 2 b 4.0  4.0 1.870829 5 0.8366600
## 3 c 4.5  4.5 1.870829 6 0.7637626

Now, lets draw the graph

ggplot(sum, aes(x=a,y=mb,fill=a)) +
  geom_line(aes(group=1)) +
  geom_point(shape=21, size=5) +
  geom_errorbar(aes(ymin=mb-se,ymax=mb+se),width=.2) +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

Note that u need to tell R to connect the reads by line by adding (group1) in geom_line

Plotting multiple lines with error bars

In many cases, we need to draw many lines for different groups, next example i will summarize the data as earlier but with additional point, i will ask R to group based on 2 variables; the forst variable represent the data points (as earlier), the second variable represents the group for different lines. I will call the output sum1

sum1 <- ddply(data, c("a","f"), 
              summarise,
              mb = mean(b, na.rm=TRUE),
              medb=median(b, na.rm=TRUE), 
              sd = sd(b, na.rm=TRUE),
              n = sum(!is.na(b)),
              se = sd/sqrt(n))

Here is the graph. Note that im telling R to group based on F different “lines”. Moreover to avoid overlapping between same points, i will tell R to shift each group by 0.3 “position=position_dodge(.3)”.

ggplot(sum1, aes(x=a,y=mb,fill=a,group=f)) +
  geom_line(position=position_dodge(.3)) +
  geom_point(shape=21,size=5,position=position_dodge(.3)) +
  geom_errorbar(aes(ymin=mb-se,ymax=mb+se),width=.2,position=position_dodge(.3)) +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  ylab("b column data")

Of course i can change the line pattern based on group

ggplot(sum1, aes(x=a,y=mb,fill=a,group=f,linetype=f)) +
  geom_line(position=position_dodge(.3)) +
  geom_point(shape=21,size=5,position=position_dodge(.3)) +
  geom_errorbar(aes(ymin=mb-se,ymax=mb+se),width=.2,position=position_dodge(.3)) +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  ylab("b column data")

Or the line color, line width, error bar width, etc # play with numbers

ggplot(sum1, aes(x=a,y=mb,fill=a,group=f,color=f)) +
  geom_line(position=position_dodge(.3),size=1.5) +
  geom_point(shape=21,size=5,position=position_dodge(.3)) +
  geom_errorbar(aes(ymin=mb-se,ymax=mb+se),size=1.5,width=0.2,position=position_dodge(.3)) +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2)) +
  ylab("b column data")


8. Histograms

Histogram differes from barplot in one important issue. REMEMBER, histogram always summarize counts on Y axis.

Lets use diamonds data

histogram takes a column and counts the repeat of each unique value, plot them.

Lets histogram diamonds price

ggplot(diamonds,aes(x=price)) +
  geom_histogram() # the base code

Understanding binwidth

When ploting histogram, you need to know about binwidth, which is the window where numbers are counted

Assume you have these numbers (1,1.3,1.6,3,4,5.4) setting binwidth to 1, count will be from 0-1, 1-2, 2-3 so previous example will be

bin count
1-2 3
2.1-3 1
3.1-4 1
4.1-5 0
5.1-6 1

Setting bin from 1 to 2 will give different count

bin count
1-3 4
3.1-5 1
5.1-7 1

Pattern should be same. Ok lets change in the binwidth and see [with some tweaks]

ggplot(diamonds,aes(x=price)) +
  geom_histogram(binwidth=100,fill="white",color="black")

hist <- ggplot(diamonds,aes(x=price)) +
  geom_histogram(binwidth=500,fill="gold",color="red")

In general, i don’t like grey background of ggplot, so i decided to make my own setting

mytheme <- theme_bw() +
  theme(text = element_text(size=20, face="bold", colour="black"),axis.text.x = element_text(vjust=2))

So i can apply it on any named graph

hist + mytheme

So far so good ?

Again, i can separate the histogram output based on the “cut” in the diamond data

head(diamonds,n=2)
## # A tibble: 2 × 10
##   carat cut     color clarity depth table price     x     y     z
##   <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1  0.23 Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43
## 2  0.21 Premium E     SI1      59.8    61   326  3.89  3.84  2.31
ggplot(diamonds,aes(x=price,fill=factor(cut))) +
  geom_histogram(color="black")

Or i can draw the graph as interleaved multiple histogram. Note that here i told R to do that by adding “position=dodge”. also note that “alpha controls transparency”

ggplot(diamonds,aes(x=price,fill=factor(cut))) +
  geom_histogram(position="dodge",alpha=0.4,color="black")

Data also can be separated in different panels . do you remember how? facetting

ggplot(diamonds,aes(x=price,fill=factor(cut))) +
  geom_histogram(alpha=0.4,color="black") +
  facet_wrap(~cut)

To draw a histgram overlaid with density

ggplot(diamonds, aes(x=price, y=..density..)) +
  geom_histogram(fill="cornsilk",color="grey50",size=.2) +
  geom_density()
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

also u can facet them based on cut or color

ggplot(diamonds, aes(x=price, y=..density..,fill=color)) +
  geom_histogram(color="grey50",size=.2) +
  geom_density(alpha=0.1) +
  facet_wrap(~color)

End of session my graph

The more you give, the more you get:)